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Abstract. In recent years, a variety of extensions and refinements have 
been developed for data augmentation based model fitting routines. 
These developments aim to extend the application, improve the speed 
and/or simplify the implementation of data augmentation methods, 
such as the deterministic EM algorithm for mode finding and stochas- 
tic Gibbs sampler and other auxiliary-variable based methods for pos- 
terior sampling. In this overview article we graphically illustrate and 
compare a number of these extensions, all of which aim to maintain 
the simplicity and computation stability of their predecessors. We par- 
ticularly emphasize the usefulness of identifying similarities between 
the deterministic and stochastic counterparts as we seek more efficient 
computational strategies. We also demonstrate the applicability of data 
augmentation methods for handling complex models with highly hierar- 
chical structure, using a high-energy high-resolution spectral imaging 
model for data from satellite telescopes, such as the Chandra X-ray 
Observatory. 

Key words and phrases: AECM, blocking, collapsing, conditional aug- 
mentation, ECM, ECME, efficient augmentation, data augmentation, 
Gibbs Sampling, marginal augmentation, model reduction, NEM, nest- 
ing. 



1. INTRODUCTION 

Numerous statistical algorithms involving data 
augmentation have enjoyed remarkable popularity 
in the biological, medical, physical, social, engineer- 
ing and other sciences. These algorithms include both 
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deterministic versions such as the Expectation Max- 
imization (EM) algorithm (Dempster, Laird and Ru- 
bin (1977)) and its many extensions and stochastic 
versions such as the Data Augmentation (DA) algo- 
rithm (Tanner and Wong, 1987), the method of aux- 
iliary variables (Besag and Green, 1993) and other 
Markov chain Monte Carlo (MCMC) methods in- 
cluding the Gibbs sampler (Geman and Geman, 1984). 
The popularity of these algorithms rests in their 
suitability for fitting highly structured models (e.g., 
missing data models, latent variable models, hierar- 
chical models, etc.) with high dimensional parame- 
ters. Such models are themselves growing ever more 
popular in modern statistical practice precisely be- 
cause complex data generation mechanisms are of- 
ten naturally defined in terms of unobserved quan- 
tities. This aides inference because the unobserved 
quantities often have a direct physical interpreta- 
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tion and are of scientific interest themselves. From 
a probabilistic point of view, complex correlation 
structures are much more easily described in terms 
of unobserved quantities and the conditional inde- 
pendence structures of hierarchical models. Thus, 
formulating multi-level models in terms of unob- 
served variables enables us to parse complex highly- 
structured data. A primary advantage of algorithms 
involving data augmentation is that even in these 
settings they are relatively easy to implement (as 
illustrated in the spectral model of Section 2) and 
enjoy stable convergence properties (e.g., EM-type 
algorithms exhibit monotone convergence in likeli- 
hood). 

In this paper we review, summarize and compare 
much of the recent work on algorithms involving 
data augmentation, with EM-like algorithms on the 
deterministic side and Gibbs-sampler-type MCMC 
samplers on the stochastic side. This work is pri- 
marily aimed at extending the applicability of the al- 
gorithms and improving their computational speed. 
We focus on methods that build on the statistical 
insight of the algorithms while maintaining their at- 
tractive properties (e.g., simplicity and stability), 



rather than numerical methods that can sacrifice 
these properties. We present basic ideas and con- 
cepts but gloss over much of the technical detail, 
which are documented in the cited references. To 
this end, we include a series of schematic graphic 
representations of the various algorithms that we 
hope can clarify and highlight their relationships, 
especially in visualizing the similarities between the 
deterministic algorithms and their stochastic coun- 
terparts. We begin with two overview schematics. 
Figure 1 describes the relationships among the vari- 
ous EM-type algorithms and Figure 2 describes the 
synergy between the deterministic and stochastic al- 
gorithms that we discuss in this article. 

The paper is organized into seven additional sec- 
tions. As a running example, Section 2 introduces 
a model for Poisson spectral imaging designed to 
analyze data from the Chandra X-ray Observatory 
and similar photon counting devices. Section 3 fo- 
cuses on methods designed to simplify calculation in 
complex models, specifically data augmentation and 
model reduction in the context of both mode-finding 
and sampling algorithms. Section 4 reviews general 
strategies for improving convergence rates such as 
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Fig. 1. A family tree of algorithms inspired by EM. The tree illustrates how various techniques have been combined with the 
basic framework of EM to formulate new algorithms. It should be regarded as a description of the historical inspiration of the 
various algorithms rather than as a hierarchy of generalizations and special cases. The basic stochastic simulation EM-type 
algorithm, known as DA, is described in Section 3.1 and Figure 3. Model reduction and ECM are described in Section 3.3 
and Figure 4- Efficient data augmentation, including CDA-EM and PXEM, is described in Section 5 and Figures 5 and 6. 
It is combined with model reduction to formulate ECME and AECM in Section 6 and Figure 11. The use of Monte Carlo 
integration with and without efficient data augmentation in MCEM and nested EM is discussed in Section 5. 5 and illustrated 
in Figures 8-10. The variance calculations of SEM and SECM are developed in Meng and Rubin (1991) and van Dyk, 
Meng and Rubin (1995), respectively. The arrows illustrate the development and combination of techniques that inspired the 
generalizations of the EM algorithm. 
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Fig. 2. The synergy between EM-type algorithms and their stochastic counterparts. The figure shows the cross-fertilization of 
EM-type algorithms and DA-type samplers. The relationships between EM and DA and between ECM and the Gibbs sample are 
illustrated in Figures 3 and 4, respectively. Marginal data augmentation and PX-DA are described in Section 5.2. The partially 
blocked Gibbs sampler that inspired the nested EM algorithm is illustrated in Figure 8 and the partially collapsed Gibbs sampler 
and its connection with the ECME and AECM algorithms are discussed in Section 6. 



blocking and collapsing. These methods are illus- 
trated in Sections 5 and 6 in the context of nesting, 
conditional augmentation, marginal augmentation, 
joint augmentation and partial collapsing. Finally, 
Section 7 applies some of these methods to the run- 
ning example and Section 8 concludes with a brief 
discussion. 

2. A POISSON SPECTRAL MODEL 

This section briefly outlines a model for spectral 
analysis in astronomy that is designed to summarize 
high-resolution X-ray and 7-ray spectra. The treat- 
ment here is simplified for illustrational purposes. 
Details can be found in van Dyk et al. (2001), Pro- 
tassov et al. (2002), Hans and van Dyk (2003), van 
Dyk and Kang (2004), van Dyk et al. (2006) and 
Park, van Dyk and Siemiginowska (2008). The spec- 
tral model is designed to summarize the relative fre- 
quency of the energy of photons (X-ray or 7-ray) 
arriving at a space-based detector. Because of the 
digital nature of the detector, energies are collected 
as counts in a number of energy bins (e.g., as many 
as 4096 on the detectors aboard the Chandra X- 
ray Observatory). These detectors have much higher 
resolution than their predecessors, and thus smaller 
expected counts per bin. Independent Poisson dis- 
tributions are therefore more appropriate to model 
the counts than the commonly used Gaussian ap- 
proximation. 



Specifically, we model a spectrum as a mixture of a 
"continuum" term and an "emission line." The con- 
tinuum characterizes the electromagnetic emission 
over a broad range of photon energies, while the 
emission line can be viewed as an aberration from 
the continuum in a narrow range of energies. A typ- 
ical spectrum might be composed of multiple con- 
tinua and multiple emission lines. For simplicity, 
we suppose there is only one of each in the model. 
In particular, we parameterize the intensity in bin 
jej = {l,..., J} as 

(1) X J (e) = S j f(d c ,E j ) + up J ( f i,a 2 ), jej, 

where 5j is the known width of bin j, f(0 c , Ej) rep- 
resents the continuum term and is a function of the 
continuum parameter, 8 1 , Ej is the known mean 
energy in bin j, v is the expected photon counts 
corresponding to the emission line, fi and a are the 
center and scale (or rather "width" ) of the emission 
line, and pj(fj,,a 2 ), which is a function of \x and a 2 , 
is the proportion of the emission line counts that 
are expected to fall in bin j. We typically quantify 
Pj(n,a 2 ) via a Gaussian distribution, a t distribu- 
tion or, in the case of a very narrow line, a delta func- 
tion. (These are all standard astronomical approx- 
imations to the distribution of the strictly positive 
photon energies of an emission line.) The collection 
of parameters, 9 C , (v, fi,a 2 ) and 6 A (defined below) 
are together represented by 9. Here we consider two 
simple forms of the continuum f(6 c ,Ej), (1) a log 
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linear model, for example, the power law ^E- , and 
(2) a free (i.e., saturated) model, f(9 c ,Ej) =0?, 
typically including a smoothing prior distribution 
such ELS cl Markov chain for @f',j £ J, for exam- 

pie, ef |0f , . . . , ef_, ~ N(ef_ ls i/w,-) for j = 2, . . . , j, 

where u = (u2, ■ ■ ■ ,oj.j) is a smoothing parameter 
and we assume a flat prior distribution for Of . 

Unfortunately, the photon counts are degraded 
in the observed data. For example, instrument re- 
sponse is a characteristic of the detector that re- 
sults in blurring of the photons, that is, a photon 
that arrives in bin j has probability My of being 
detected in bin i € 1 = {1, . . . ,/}. The I x J ma- 
trix {Mij} is determined by on-going calibration of 
the detector and is presumed known. (Because cali- 
bration can be conducted at higher resolution than 
the binning of the detector, the instrument response 
matrix may not be square.) Another complication is 
absorption, a process by which a proportion of pho- 
tons in a given energy bin are absorbed by matter 
between the astronomical source and the detector. 
This results in stochastic censoring, where the cen- 
soring rate varies with energy. A similar process oc- 
curs in the telescope itself: the detector's effective 
area depends on the energy of the photons. Finally, 
the counts are contaminated by background events. 
Because of these degradations, we model the ob- 
served counts as independent Poisson variables with 
parameters 

J 

(2) = M lJ X j (6)d j g(e A ,E j ) + 9f , i G 1, 

j'=i 

where dj is the (presumed) known effective area of 
the detector for energy bin j as a proportion of the 
total detector area, g(9 A ,Ej) is the probability that 
a photon of energy Ej is not absorbed by matter be- 
tween the source and the detector and 9f is the Pois- 
son intensity of the background, which is generally 
estimated via real-time calibration in space. The ab- 
sorption model, g(8 A ,Ej), may be a (constrained) 
log linear model with A denoting the model pa- 
rameter. Note that Xj(9) in (2) is given by (1). 

How to construct simple, stable and efficient algo- 
rithms for fitting this model is the running example 
for the rest of this article. 

3. STATISTICAL CONCEPTS AND 
COMPUTATION 

The EM algorithm is unique among common nu- 
merical optimization routines in that it is primar- 



ily formulated in statistical rather than mathemat- 
ical terms. The missing data setup, the Expecta- 
tion step and the complete-data computations of 
the Maximization step of EM stand in contrast, for 
example, to the derivatives and local linearization 
of the Newton-Raphson algorithm. Other EM-type 
optimizers and their related stochastic samplers ex- 
tend this in that their motivation and implementa- 
tion rely heavily on statistical concepts and insight. 
In this section we discuss two such concepts: data 
augmentation and model reduction. We show how 
their effective use of the divide-and-conquer strat- 
egy of reducing a complex problem into an iterated 
sequence of simpler ones has led to a rich class of 
statistical algorithms. 

3.1 Data Augmentation 

Computational methods based on data augmen- 
tation are generally applied to posterior distribu- 
tions or likelihood functions. Here we generally take 
a Bayesian perspective, but are mindful of the fact 
that for computational purposes a likelihood func- 
tion is equivalent to a posterior density under a con- 
stant prior distribution. Thus, the object of study 
can be written as 

(3) p(9\Y° hs ) = |p(M|Y° b8 )M#), 

where Y ohs is the observed data, fi is a common mea- 
sure such as a Lebesgue or counting measure, 9 is 
the unobserved quantity of primary interest, and (ft 
includes nuisance parameters, latent variables, miss- 
ing data or any other unobserved quantity of sec- 
ondary interest. Embedding p(9\Y obs ) into a model 
on a larger space such as p(9, </>|Y obs ) in this way 
is called the method of data augmentation. This 
method can be used to either compute the mode 
of 9 under the marginal distribution given in (3) or 
to obtain a sample from (3) which in turn can be 
used to approximate the posterior mean, variance, 
quantiles, etc., via Monte Carlo simulation. 

In the spirit of the EM literature, we use a more 
inclusive notation Y au § in place (ft, where Y au s is 
called the augmented data and represents the com- 
bination of y obs and any latent variables or missing 
data. The target posterior distribution can be ex- 
pressed as 

(4) p(9\Y° hs )K P (Y ohs \9)p(9), 

where p{9) is a prior distribution and p(Y obs \9) yields 
a likelihood. In this way, data augmentation meth- 
ods can be viewed as embedding (4) into a larger 
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augmented data model, via 

(5) [ p(Y^\6)fi(dY^)=p(Y obs \9), 

where A4 is some many-to-one mapping from Y aug 
to y obs . Using the factorization 

(6) p(Y aus \8) =p(Y aus \Y ohs ,6)p(Y obs \6), 

we recognize that (5) can be maintained with any 
choice of p(y aug |y obs , 9), that is, as long as p(Y au ^\9) 
yields the correct marg inal distribution p(Y ohs \9). 
In some cases we can use this flexibility to intro- 
duce artificial augmented data purely for computa- 
tional reasons. Thus, we can choose p{Y au ^\Y obs ,9) 
in order to optimize or improve computational per- 
formance rather for statistical modeling, as we shall 
discuss in Section 5. 

Data augmentation can lead to useful algorithms 
if the conditional distributions, p(Y an& \Y ohs ,9) and 
p(9\Y aug ), are easy to work with (e.g., to sample, 
maximize and/or compute expectations). Thus, a 
useful choice of an augmented data model specifies 
a division of a model into two simpler conditional 
models which are typically much easier to analyze. 

The EM algorithm computes a posterior mode 
using the conditional distributions via the familiar 
two-step iteration, consisting of 



E-step: Compute 

Q{6\9^)=E{logp(9\Y aus )\Y ohs ,0(% 

M-step: Set 6>(* +1 ) = argmax Q{0\0®), 

where the parenthetical superscript t indexes the it- 
eration. This iteration is known to increase p(9\Y obs ) 
and converges to a stationary point of p(9\Y obs ) 
that is generally, but not always, a (local) mode of 
p(9\Y ohs ) (Dempster, Laird and Rubin, 1977; Wu, 
1983; Vaida, 2005). The two steps of this iteration 
give EM its name, that is, the Expectation or E-step 
and the Maximization or M-step. 

The Data Augmentation (DA) algorithm of Tan- 
ner and Wong (1987) replaces the two steps of the 
EM algorithm with two sampling steps, each sam- 
ples one of two full conditional distributions: 

Step 1: (y au s)(*+!) ~p(Y au z\Y ohs ,9^), 
Step 2: ^ p(9\(Y au ^ t+1 ^). 

This iteration produces a Markov chain, {9^\t = 
1,2,...}, which under mild regularity conditions has 
the desired stationary distribution, p(9\Y ohs ) (see 
Roberts, 1996; Tierney, 1994, 1996, for convergence 
results). The EM and DA algorithms are compared 
in Figure 3. In all of the figures in this article, condi- 
tioning on y obs is suppressed, and hexagons, circles 
and squares (or their elongated versions) represent 
expectation steps, (conditional) maximization steps 
and random draws, respectively. 




Fig. 3. The EM (left panel) and DA (right panel) algorithms. In the maximization step of EM, we compute 9 to maximize 
the conditional expectation of logp(#|Y aug ), with the expectation computed in the expectation step; see Section 3.1. 
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Table 1 

Data augmentation in the spectral model. For all variables, j £ J , i £ I, and s £ S, where J indexes the ideal bins, I 

indexes the detector bins, and S indexes the sources 



Level 


Variable 


Notation 


Range 


1. 


The ideal data: no blurring, binning, background 


{Y C ,Y L } 


Positive, keV 6 




contamination, absorption" or mixing of sources 






2. 


The binned ideal counts 


{YfX L } 


Counts 


3. 


The binned ideal counts after absorption 




Counts 


4. 


The mixed and binned ideal counts after absorption 


n 


Counts 


5. 


The mixed, binned and blurred ideal counts 




Counts 




after absorption 






6. 


The mixed, binned and blurred ideal counts 


■w"obs 


Counts 



after absorption and background contamination, 
this is, the observed data 

a In the statistical model the effective area of the instrument is handled in exactly the same way as absorption. Thus, in this 

table, absorption includes the effective area of the detector. 

'The ideal data are the photon energies measured in kiloelectron volts (keV). 



3.2 Data Augmentation in the Spectral Model 

Table 1 lists a hierarchy of augmented data struc- 
tures used to construct EM and DA algorithms for 
fitting the spectral model described in Section 2. 
In the notation of Table 1 more dots in the ac- 
cent above a variable represent greater degrees of 
augmentation; variables with fewer dots are (some- 
times stochastic) functions of those with more dots. 
The set S is the collection of photon sources, here 
simply S = {C,L}, where C represents the contin- 
uum and L the emission line. The superscript on 
"y" represents the photon source; a "+" in the su- 
perscript indicates a mixture of both sources. 

Reading top-to-bottom in Table 1, the relation- 
ships among the variables follows. The vectors 
Y and Y contain the exact energies of photons 
attributed to the continuum and emission line, re- 
spectively. Because photon arrivals follow a Poisson 
process, the length of both of these vectors are Pois- 
son variables; the length of Y L has expectation v. 
These energies are binned and the resulting counts 
recorded as Y s = (Y^ , . . . ,Yj), for s € S. Absorp- 
tion and the varying effective area of the instrument 
cause an energy-varying proportion of these counts 
to be lost. In particular, 

Yf\Y?,0~ Bmomial(Y/, djg(9 A , Ej)), 

(7) 

j € J, s € S. 

For the observer, the continuum and emission line 
counts are combined, = Yp + Yj" for each j. 
Blurring, due to instrument response, shuffles pho- 



tons among the bins and into the observed bin counts 
via 

J 



(8) 



Y+IY+, 



^2 Multinomial^, Mj), 

3=1 



where Y+ = (Y+,..., Y}+), Y + = . . . ,Yj~), and 
Mj is the jth column of M, j S J . Because M may 
not be a square matrix, the lengths of Y + and Y + 
may differ. Finally, background contamination leads 
to the observed bin counts, 



(9) Y° hs \Y i + ,6~Y i + + Poisson(0f), *G x - 

This augmented-data construction leads to easy 
implementation for two reasons. First, each level 
of augmented data follows a standard distribution 
given 9 and the data in the rows lower in Table 1. 
Reading Table 1 bottom-to-top, each conditional dis- 
tribution can be derived using the Bayes theorem. 
For illustration, we report the details of just two: 

^obs n _ TJ4 „;„1 / -i^obs 



(10) 



Y+|Y 4 obs ,#~ Binomial (Y; c 



where £i(0) is defined in (2), and 

(11) y/|F/,0~Y/+Poisson(^), jej, 

where r]j = vpj{[i,,a 2 )(l — djg(9 A , Ej)). The other 
necessary conditional distributions can be found in 
Appendix B of van Dyk et al. (2001). Thus, the 
E-step of EM and the corresponding draw of DA 



CROSS-FERTILIZING STRATEGIES FOR EM ALGORITHMS AND DA SAMPLERS 



7 



are straightforward. Second, given the data in Ta- 
ble 1, the posterior distribution of 9 is a set of inde- 
pendent standard distributions. For example, given 
Y L '~ ' N(/i,cr 2 ), it is easy to compute the posterior 
distribution of (u,^,a 2 ), recalling that the length 
of Y is a Poisson random variable with mean v. 
The posterior distributions of the other components 
of 9 are also standard and simple to derive. Thus, the 
M-step of EM and the corresponding draw of DA are 
again easy to implement. Incorporating proper prior 
information can be accomplished using the appropri- 
ate semi-conjugate prior distributions as described 
in van Dyk et al. (2001). 

3.3 Model Reduction 

Model reduction involves using a set of (typically 
complete) conditional distributions in a computa- 
tion method designed to learn about the correspond- 
ing joint distribution. Reducing the augmented-data 
model significantly broadens the applicability of al- 
gorithms involved in data augmentation, while main- 
taining their stable convergence properties (e.g., 
Meng and Rubin, 1993). In particular, if we parti- 
tion 9 into P subvectors, 9 = (9%, . . . ,6p), reducing 
the augmented data model involves working with 
the set of conditional distributions p{9\ |y aug , 9-\), 
. . . ,p(9p\Y ans ,9^p) in place of directly working with 



p{9\Y^); here 0_ p = (9 U . . . , 9 p+1 , . . . ,9 P ). For 
example, the ECM algorithm (Meng and Rubin, 1993) 
replaces the maximization in the M-step of EM with 
a sequence of P conditional maximizations or CM- 
steps of the form 

CM-step p: Set t?(*+P/ p ) = argmaxg Q{9\9®) 

subject to 9 { ^ p/P) = 9^ (p ~ 1)/P) . The ECM algo- 
rithm is useful when the CM-steps exist in closed 
form but the M-step does not. ECM is illustrated 
with P = 2 in the left panel of Figure 4. 

The same strategy can be applied to the DA sam- 
pler. By replacing the draw from p(9\Y aug ) with a 
sequence of draws from the corresponding full condi- 
tional distributions, the sampler becomes a (P + 1)- 
step Gibbs sampler. This sampler is illustrated in 
the right panel of Figure 4. In the context of sam- 
pling, we can also reduce p(Y a,ns \9) into a set of con- 
ditional distributions. Partitioning the expectation 
step, however, has proven much more illusive. One 
strategy involves using the law of iterated expecta- 
tions in the computation of the E-step and results 
in the Nested EM algorithm; see Section 5.5. 

Rather than using a partition of 9, a more gen- 
eral model reduction scheme updates 9 by condi- 
tioning on a sequence of functions of 9. It is only 
required that the functions allow movement any- 
where in the parameter space, that is, the functions 




p(Y aus \9) 



K0i|02,K aug ) 




p(9 2 \9 u Y^) 





Expectation Step 



Conditional 
Maximization Step 



Random Draw 



Fig. 4. The ECM algorithm and the Gibbs sampler. The left panel shows a three-step ECM algorithm composed of an E-step 
and two CM-steps. The corresponding Gibbs sampler is illustrated in the right panel and is composed of three steps including 
a data augmentation step. In the conditional maximization steps of ECM, we compute the component of 6 to maximize 
the conditional expectation of the log of the quantity in the Q, with the expectation computed in the expectation step; see 
Section 3.3. 
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are "space- filling" as described by Meng and Ru- 
bin (1993). Again, the same strategy can be used in 
sampling algorithms, such as the Bayesian IPF sam- 
pler used to fit constrained models on contingency 
tables (Schafer, 1997; Gelman et al., 2003). Recent 
work by Yu and Meng (2010) further explores the 
use of this strategy to improve MCMC algorithms 
by employing a sequence of sufficient and auxiliary 
data augmentation schemes that are space filling. 

3.4 Model Reduction in the Spectral Model 

To illustrate model reduction in an augmented 
data model, we consider the second form of the con- 
tinuum model, namely, the free model f(9 c ,Ej) = 
Of with a Markov-chain- type smoothing prior Of \Q\, 
. . . , ru N(flJL l5 1/ooj) for j = 2, . . . , J, where u = 
(tj2, ...,ujj) is a smoothing parameter and we as- 
sume a flat prior for 0^ . For simplicity, we assume 
there is no emission line and that Sj = g(0 A , Ej) = 1 
for each j, that is, the bins are of the same size 
and that there is no absorption. In this case, we use 
only rows 4-6 of Table 1 in our data augmentation 
scheme to derive 

J 

Q(e\e®) = ^[E(yf|y obs , o®) io g of - of] 

(12) 

-\j:^f-eUf. 

3=2 

Once we have computed the expectation in (12), 
we need only optimize Q(9\9®) as a function of 
9. Unfortunately, this optimization cannot be done 
analytically when some ojj > 0. However, the par- 
tial derivative of Q(9\9®) with respect to Of is a 
quadratic function of 9f if we fix 0_ .. Thus, as is dis- 
cussed by Fessler and Hero (1995) and is improved 
in Section 7, we can construct an ECM algorithm 
with J CM-steps of the form 

= max jo, (Bj + ^B? + A J -E(y/ 7 |y° bs ,0(*))) 

where 

Aj = ojj + 0Jj+i and 

Bj = _ Wi+1 (ef +l) (*))/2. 



4. IMPROVING RATES OF CONVERGENCE 

EM-type algorithms and their stochastic counter- 
parts have seen many applications largely because of 
their computational stability and simple implemen- 
tation. Nonetheless, these methods are legitimately 
criticized for their slow convergence in some set- 
tings. Strong posterior correlations among the com- 
ponents updated in each step lead to full condi- 
tional distributions that are far less variable than 
the corresponding marginal distributions. This in 
turn leads to smaller step sizes and slower progress 
toward the mode or toward the stationary distri- 
bution. Much work has been focused on develop- 
ing algorithms with improved rates of convergence 
that continue to enjoy the simplicity and stability 
that makes data augmentation so useful in practice. 
As we shall see with both data augmentation and 
model reduction, less is better if one hopes for speed, 
while more is often better if one hopes for simplicity. 
In this section we discuss the sometimes conflicting 
strategies for improving the computational perfor- 
mance of methods based on data augmentation. 

4.1 The EM and DA Rates of Convergence 

Before we can develop criteria for speeding up 
data augmentation methods, we need mathemati- 
cal measures of their rates of convergence. For EM, 
such a measure is given by pem, the spectral radius 
of the so-called matrix fraction of missing informa- 
tion (Dempster, Laird and Rubin, 1977), 

(13) j_ I obs[jaug^aug)]-l ) 

where I is an identity matrix, I obs is the observed 
Fisher information matrix, and J au g(y au s) = 
-d 2 Q{9\9*)/{89 d0)\ e=e * with 0* the posterior mode; 
our notation for J aug emphasizes that both Q{9\9') 
and the augmented-data information matrix depend 
on the choice of augmented data model. Here we 
use the traditional terms (e.g., Fisher information) 
of the EM literature, which primarily focus on like- 
lihood calculation, even though we are dealing with 
the more general posterior computation. In partic- 
ular, I obs is the negative of the second derivative of 
the log posterior density evaluated at the posterior 
mode. 

We call /Jem the global rate of convergence and I — 
7 obs (7 aug ) _1 the matrix rate of convergence of the 
EM algorithm. More general formulations of the rate 
of convergence for ECM and other EM-type algo- 
rithms are given by Meng and Rubin (1993, 1994), 
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Meng (1994), Meng and van Dyk (1997) and van 
Dyk (2000b). For the EM algorithm, our goal is to 
minimize pem as a function of the data augmenta- 
tion scheme. 

For the DA algorithm, the geometric rate of con- 
vergence (Amit, 1991) is 

(14) 1- inf E[Var(/i(fl)|y aug )|Y obs l. 

h:Var(h(6»)|y obs )=l 

Although this quantity and the maximum lag one 
autocorrelation (Liu, 1994) are valuable for theo- 
retical calculations, they are generally difficult to 
work with analytically in particular models. The 
EM-approximation of van Dyk and Meng (2001) is 
essentially based on a Gaussian approximation to 
the posterior distribution and simply replaces these 
quantities by pem- Van Dyk and Meng (2001) illus- 
trate that this approximate EM criterion can lead 
to substantial improvements in DA samplers. Thus, 
one of our basic strategies is to focus on methods 
that reduce pem with an understanding that such 
methods are useful in formulating efficient data aug- 
mentation schemes for both deterministic and sto- 
chastic algorithms. 

4.2 Blocking and Collapsing 

As the formulations of the matrix rates of con- 
vergence for more complex EM-type algorithms in 
the above cited articles illustrate, analysis of con- 
vergence is significantly more complex with multi- 
step algorithms. In the analysis of DA and Gibbs 
samplers, the spectral radius and the norm of the 
forward operator are useful measures of the con- 
vergence behavior of a Markov chain (Liu, Wong 
and Kong (1994); Liu, 2001). Based on these mea- 
sures, Liu, Wong and Kong (1994) introduced two 
strategies that have emerged as important general 
techniques for improving the behavior of Gibbs-type 
samplers. 

To illustrate these techniques, consider a P-step 
sampler that simulates each component of = (0\ , . . . 
Op) in turn conditioning on the most recently sam- 
pled values of the other P — 1 components of 0. 
The first strategy, known as blocking, involves com- 
bining two or more draws into a single draw. For ex- 
ample, the last two steps could be combined into a 
single draw of (0p-\,0p) given the other P — 2 com- 
ponents of 0. Collapsing, on the other hand, involves 
the construction of a sampler on a subspace of the 
original sampler. For example, we might compute 
the marginal distribution of 9-p by integrating out 



9p and construct a (P — l)-step sampler using the 
full conditional distributions of the first P — 1 com- 
ponents of the original partition of 9. Each of these 
components is updated conditioning on the most re- 
cently sampled values of the other P — 2 components 
of 6Lp to construct a Markov chain with station- 
ary distribution equal to the marginal distribution 
of 

Liu (2001) shows that both of these strategies are 
expected to improve the convergence behavior of the 
original P-step sampler in that they reduce the norm 
of its forward operator. (For Gibbs samplers with 
more than two steps, the norm may not be equal 
to the rate of convergence of the Markov chain.) He 
also showed that collapsing reduces the norm by at 
least as much as blocking. Thus, good general ad- 
vice is to collapse whenever possible, and to block if 
you can when collapsing is not possible. Liu's tech- 
nical results apply only when blocking is applied to 
the last steps of each iteration of a Gibbs sampler 
and/or when the subparameter sampled in the last 
step is collapsed out of the sampler, as we discussed 
for illustration in the previous paragraph. Nonethe- 
less, experience shows that both strategies are more 
generally useful and should be implemented when- 
ever feasible. 

Analogous advice applies to EM-type algorithms. 
In the comparison of the EM and ECM algorithms, 
blocking suggests that fewer CM-steps should be 
preferred and that the ECM algorithm is expected 
to converge more slowly than the corresponding EM 
algorithm. While this is good general advice, it does 
not always hold mathematically; Meng (1994) gives 
a simple example in which ECM outperforms EM. 
We emphasize that the motivation of ECM, how- 
ever, is not faster convergence but easier implemen- 
tation. We generally consider ECM when the M-step 
of EM is not tractable and, thus, the EM algorithm 
itself is not feasible. 

Collapsing is also a useful strategy in the context 
of EM-type algorithms. The next section is devoted 
to methods that aim to reduce the information in 
y aug and thus effectively collapse a portion of Y au § 
out of the iteration. Section 6 describes intermediate 
strategies that allow partial collapse when full col- 
lapse is not possible, as in the ECME and AECM 
algorithms. 

In the context of EM, we can sometimes also col- 
lapse 9 via a profile loglikelihood. Suppose that 9 = 
(01,02) and that we are able to compute the pro- 
file likelihood l(0r,Y ohs ) = £(0 1 ,0 2 (0i,Y obs )\Y obs ), 
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where 6 2 (&i, Y obs ) is the maximizer oi £(e 1 ,e 2 \Y obs ) 
when 6\ is fixed. There are two ways to construct 
an EM algorithm in this situation. The first way 
is to construct a data augmentation, Y au g, to im- 
plement EM for the full parameter 6 = (61,62) via 
the full augmented data loglikelihood, i(6\, # 2 |Y aug ). 
That is, we do not take advantage of the poten- 
tial computational gain of using the profile likeli- 
hood. The second way is to construct a data aug- 
mentation, y au §, to augment the profile likelihood 
£(6i;Y ohs ) and then implement the EM algorithm 
for the subparameter 6\ only. Note that here we use 
the notation £(6y,Y obs ) rather than l((9i|Y obs ) to 
emphasize that £(6r,Y ohs ) may not necessarily be 
a proper loglikelihood in the sense of being derived 
from a log density or probability of Y obs . We can 
nonetheless use EM, because it is possible to con- 
struct an EM algorithm for maximizing any objec- 
tive function D(6;Y obs ) as long as we can find an 
augmented objective function Z)(#;Y aus ) such that 
exp{L>((9; Y au §) -D(6; Y obs )} is a proper conditional 
density function of Y ang given 6 and Y obs ; see the 
rejoinder of Meng and van Dyk (1997) for more dis- 
cussion on this flexibility of EM. Therefore, it is pos- 
sible to use EM for the profile likelihood by treating 
£(6\;Y obs ) as an objective function. This collapsing 
through profiling has not been generally recognized, 
but can significantly improve the speed, when com- 
pared to the first way of directly applying the EM 
algorithm to the full likelihood. See Meng (1997) 
for more discussion and an example involving a zero 
inflated Poisson model. 

5. EFFICIENT DATA AUGMENTATION 

Inherent in the definition of the augmented data 
model is a choice: There are infinitely many aug- 
mented data models satisfying (5). In this section 
we discuss various criteria for this choice that re- 
sult in efficient algorithms. By "efficient data aug- 
mentation" we mean using augmentation schemes 
that improve speed, while maintaining stability and 
simplicity. Here we discuss techniques that are able 
to achieve all three criterion: They reduce the aug- 
mented data in the construction of the algorithm to 
improve speed while maintaining stability and sim- 
plicity. 

The basic idea is similar to collapsing in the Gibbs 
sampler. Suppose that an EM algorithm or a data 
augmentation sampler can be constructed with a 
baseline data augmentation scheme that we denote 



Y au s. Further suppose that Y aug = Y aug U Y" 2 aug , whe- 
re both Y^ ug and Y 2 aug are legitimate data augmen- 
tation schemes in that they both contain Y obs . It 
is easy to show that / au g(y au g) > / a ug(y au g) [i.e., 
that J au g(y au g) _ / a ug(y au g) i s semi-positive defi- 
nite] and that E[Var(/i(60|Y aug )|Y obs ] < E[Vav(h(6)\ 
Y^ ns )\Y obs ], where h(-) is any real-valued function, 
the first expression being an asymptotic variant of 
the second (Meng and van Dyk, 1999). Thus, by (13) 
and (14), construction of an alternate algorithm us- 
ing only Yi g as the augmented data results in faster 
convergence. This strategy effectively collapses Y aug \ 
Y aug out of the algorithm. We will discuss direct ap- 
plications of this idea when we discuss the nesting 
strategy in Section 5.5. Less direct applications are 
the topic of Sections 5.1-5.3. The methods described 
in these sections do not directly decompose Y aug 
into two components but still aim to either reduce 
jaug^yaug) Qr to increase E[Var (h(6) |Y aug ) |Y obs ] . 

5.1 Conditional Augmentation 

The methods of conditional, marginal and joint 
augmentation all take advantage of the flexibility 
in (5) to introduce less informative augmented data 
in order to construct a more efficient algorithm. To 
search for a good augmented data model using any 
of the three methods, we begin by parameterizing 
the augmented data model using a working parame- 
ter. We define a working parameter to be a parame- 
ter in the augmented data model that is not identi- 
fiable under the observed data model, p(Y ohs \6). In 
particular, we generalize (5) via 



(15) 



.A/f(yaug) = y-obs 

:p(Y° bS \6) 



p(Y^\6,a)p(dY^) 



for all a in some class A. Notice that the right-hand 
side of (15) does not depend on the working pa- 
rameter. An effective method of introducing a is to 
let Y aug = T> a>e (Y an ^), where V afi is a one-to-one 
mapping for any 6 and a & A and Y aug is the base- 
line augmented data. Typically Y aug is the standard 
augmented data used to construct EM-type algo- 
rithms or samplers for fitting a particular model. In 
the context of the EM algorithm, we can compute 
the scalar rate of convergence, pem(o:), for each a. 
Conditional augmentation simply optimizes pem(«) 
as function of a and then conditions on the optimal 
value of a throughout the iteration. Meng and van 
Dyk (1997) call an EM algorithm constructed with 
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set 

Ci" = argminp EM (a) 




(16) 



A^(y au s)=y° bs 



p(Y aug \9,a)p(da) 



/j(dY aug ) 



-p(Y obs \6). 



Expectation Step 



Maximization Step 



Fig. 5. The EM algorithm constructed with conditional data 
augmentation. [In the maximization step we compute 9 to 
maximize the conditional expectation o/logp(6|V aus , a*), with 
the expectation computed in the expectation step. Here we use 
the superscript "a" as an abbreviation for "aug" or "aug- 
mented. "J 



the resulting optimal data augmentation scheme an 
efficient data augmentation EM algorithm. For clar- 
ity, we refer to it here as a conditional data augmen- 
tation EM algorithm or CDA-EM. Although this 
choice of augmented data model is based on the EM 
rate of convergence, the same model can be used to 
construct data augmentation samplers. This is an 
example of the approximate EM criterion discussed 
in Section 4.1. 

It is worth noting that the optimization required 
by conditional augmentation occurs as part of the 
derivation of the algorithm. The value a is fixed 
when we run the algorithm; see Figure 5. The meth- 
ods of marginal and joint augmentation, on the other 
hand, avoid this initial optimization problem by av- 
eraging over or fitting a on the fly, and, more im- 
portantly, they can lead to better algorithms. 

5.2 Marginal Augmentation 

Marginal augmentation also begins with (15), but, 
in addition to a working parameter, introduces a 
working prior distribution, p(a). The working prior 
distribution is typically chosen so that a and 6 are 
independent, so that 



Note that if we define the resulting augmented data 
model as p(Y^\9) = J p(Y aug |#, a)p(da), we obtain 
fp(Y m «\e)ii(dY m «) =p(Y obs |0). Thus, (16) results 
in a legitimate data augmentation scheme. [Marginal 
augmentation was introduced by Meng and van Dyk 
(1999) and is very closely related to the PX-DA sam- 
pler of Liu and Wu (1999).] 

This strategy is motivated by a desire to reduce 
the information in Y ang for 6. Since conditioning 
tends to increase information, marginalization may 
be advantageous. In particular, for any function h(-), 
we have 

E[Var(/ i (^)|y aug )|y obs ] 

(17) =E[E[Var(/i(6>)|Y aug ,a)| Y obs , a] \ Y ohs ] 

+ E[Var [E(h(6) | Y aug , a) | Y aug ] | Y obs ] . 

If p(Y aug |<9,a) is generated by Y aug = £> a (Y aug ) us- 
ing the baseline augmentation, Y aus , then E[Var(/i(^)| 
Y aug ,a)|Y obs ,a] does not depend on a and (17) im- 
plies 

E[Var(/i(fl)|Y aug )|Y obs ] 

>E[Var(/i(#)|Y aug ,a)|Y obs ,a] 

for any a, and, thus, in terms of the geometric rate, 
marginal augmentation is superior to conditional aug- 
mentation (Meng and van Dyk, 1999). This result, 
however, depends on the working parameter being 
introduced via Y aug = V a (Y aug ), a transformation 
depending only on a. When the transformation de- 
pends on the model parameters as well, conditional 
augmentation can be superior. See Meng and van 
Dyk (1999) or Liu and Wu (1999) for details. 

Although there is no need to choose a when using 
marginal augmentation, we are left with the choice 
of working prior distributions. One strategy for choo- 
sing p{a) (van Dyk and Meng, 2001) suggests pa- 
rameterizing the working prior, p(a\ip), and chooses 
ip as a level-two working parameter via a conditional 
augmentation criterion. Liu and Wu (1999) show 
that, under certain conditions, the Haar measure 
leads to an optimal algorithm with the correct sta- 
tionary distribution. In general, however, using an 
improper working prior distribution may not even 
lead to the correct stationary distribution, let alone 
optimality; see Meng and van Dyk (1999), van Dyk 
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and Meng (2001) and van Dyk (2009). When it ex- 
ists, the use of the Haar measure typically leads 
to a joint chain on the enlarged space (a, 6, y au s) 
that is nonpositive recurrent, but the marginal chain 
on the original space converges properly to the 
desired posterior distribution p(6\Y ohs ); see Hobert 
(2001), Marchev and Hobert (2004) and Hobert and 
Marchev (2008) for additional discussion. 

5.3 Joint Augmentation 

There is no known easy way to implement EM- 
type algorithms that use marginal augmentation. 
A similar strategy, however, uses the augmentation 
scheme (15), but rather than optimizing pem as a 
function of a before running the algorithm or margi- 
nalizing a out as in (16), this method fits a jointly 
with 6 in the M-step. In particular, Liu, Rubin and 
Wu (1998) presents the PXEM algorithm fast 
adaptation of conditional augmentation in the con- 
text of the EM algorithm in the case when p{6) oc 1, 
for example, in maximum likelihood estimation. Van 
Dyk (2000a) slightly extended the framework to the 
Bayesian case, by defining 

Q px (9,a\9',a ) 

= J log[p(Y^\e,a)p(9)} 

■ P (Y aus \Y obs ,0',a o )dY &us . 

As illustrated in Figure 6, the PXEM iteration sets 
(^(*+ 1 ) ) equal to the maximizer of Q px (8, 

a\6^ , ao), where ao is some fixed value. 1 The partic- 
ular value of ao is generally irrelevant for a PXEM 
iteration and is simply set to some convenient value 
throughout the iteration (e.g., ao = 1 for scale work- 
ing parameters and ao — for location working pa- 
rameters). In this regard, the PXEM iteration could 
be rewritten to avoid the dependence on ao, but it is 
generally deemed easier to simply set ao at one arbi- 
trary value and avoid potentially complex algebraic 
manipulations. The situation is similar when using 
marginal augmentation with an improper working 
prior distribution. In that case the posterior distri- 
bution of a is improper leading to the technical con- 
cerns discussed in Section 5.2. With PXEM the ob- 
served data likelihood does not depend on a which 



x We need not condition on a = cr*' in Q px because 
Q pyc (8,a\8',a') > Q p *(8' ,a!\8' ,a') implies p(8\Y obs ) > 
p(8'\Y° hB ) for any values of 8' and a'. In particular, 

Qpx(0 ( * +1) ,a (t+1) |0 (t) ,a o ) > Qpx(0 (t) ,ao|0 (t) ,a o ) implies 
p(8 (t+1) \Y° hs ) >p((9 (t) |Y obs ); see Liu, Rubin and Wu (1998) 
and van Dyk (2000a). 




Fig. 6. The PXEM Algorithm. [In the maximization step we 
compute 8 and a to maximize the conditional expectation of 
logp(#, a|y aug ), with the expectation computed in the expec- 
tation step.j 

can lead to numerical problems if the updated value 
of a is carried forward in the iteration. 

We expect PXEM to perform at least as well as 
an algorithm that fixes a (i.e., CDA-EM) in terms 
of the global rate of convergence because it essen- 
tially removes the conditioning on a in the data- 
augmentation scheme. Removing this conditioning 
reduces I aug (in a positive semidefinite ordering sen- 
se) and thus improves the rate of convergence of 
EM (see Meng and van Dyk, 1997, and Liu, Ru- 
bin and Wu, 1998, for details). It is in this regard 
that PXEM is an example of efficient data augmen- 
tation: it effectively reduces the augmented data in- 
formation in order to improve the rate of conver- 
gence without sacrificing simplicity or stability. This 
does not mean that PXEM generally dominates a 
CDA-EM algorithm because different augmentation 
schemes are used in the context of the two strate- 
gies. In particular, like marginal data augmentation, 
PXEM is generally implemented with a transfor- 
mation, y aug = £> a (y aug ). However, unlike that of 
conditional data augmentation, this transformation 
does not depend on 6; see Figure 6. Liu, Rubin and 
Wu (1998) give an alternative explanation for the ef- 
ficient performance of PXEM, that by fitting a, we 
are performing a covariance adjustment to capitalize 
on information in the data-augmentation scheme. 
They also illustrate the substantial computational 
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advantage PXEM can offer over other EM-type algo- 
rithms for ML estimation. In the context of Bayesian 
calculations, van Dyk and Tang (2003) show how 
one-step-late methods (Green, 1990) can be used to 
accomplish the required optimizations of the PXEM 
M-step. 

5.4 A Graphical Comparison of CDA-EM and 
PXEM 

To illustrate the differences between the CDA-EM 
and PXEM algorithms, we consider a simple Gaus- 
sian model. Suppose 

(18) X;~N(0,l/2) fori = l,...,n 
and 



(19) 



*i~N(0,l/2) fori = l,. 



where the X = (X\, . . . ,X n ) is observed and Y = 
(Yi, . . . ,Y m ) is completely missing. Obviously, the 
maximum likelihood estimate of 9 is X and the miss- 
ing Y is not relevant. Nonetheless, for illustration, 
we can construct an EM algorithm that treats Y as 
missing data. In particular, with (X,Y) being the 
augmented data, we have 



Q(9\0®) = 2B 



nX + Y f E(Yi\9®) 



(n + m) 



= 28{nX + m<9 w ) - (n + m)9 2 , 

which can be compared to the observed data log- 
likelihood, £(0), as in the first panel of Figure 7, 
where n = 1, m = 5, X = 0, and = 5. The panel 
illustrates that i(9) and Q(9\9^) have the same 
derivative at 0® and that their optimizers are the 

maximum likelihood estimate, 9* ', and ^em^ > res P ec_ 
tively. (For diagrams illustrating EM's iteration and 
rate of converge, see Navidi, 1997.) 

To use CDA-EM and PXEM, we introduce a work- 
ing parameter a, via the transformation, Z{ = Yi — 
a6 ~ N[(l — a)0, 1/2] for i = 1, . . . , m, and treat Z = 
(Z\, . . . , Z m ) as the missing data. Since a is not iden- 
tifiable given X , it is a valid working parameter. In 
this case, 



Q(9,a\9^\a') = 29 



nX + (1 - a)^E(Zi|0W,a') 



— [n + m(l — 
29[nX + m(l 

— [n + m(l — 



a 



a 



=1 
f 

a)(l 
) 2 }9 2 . 



a 



')0 W ] 



The method of conditional data augmentation re- 
quires J aug (a) = 2[n + m(l — a) 2 ] be computed by 
differentiating Q(0, a\0^\ a') twice with respect to 
9 and minimized it as a function of a. The optimal 
value occurs when a = 1, in which case the distri- 
bution of the missing data does not depend on 9. 
The second panel of Figure 7 compares Q a {0\9^) = 
Q(9, a\9^' , a) computed with several values of a with 
t(9). As a grows closer to one, grows closer 

to #mle- With the optimal value of a in this ex- 
ample, Q a (0\0®) and 1(9) coincide, and CDA-EM 
converges to 9* in one iteration. In general, the al- 
gorithm does not converge in one step, but the un- 
derlying strategy of choosing a working parameter 
so that Q a (9\9^) is closer to 1(9) is always the goal. 

For PXEM, a' is fixed at the identity value of the 
transformation from Y to Z (i.e., a' = 0) and 9 and 
a are updated at each iteration by jointly optimizing 
Q(9,a\9 {t \a' = 0). The third panel of Figure 7 plots 
this function using a heat map, where brighter colors 
represent higher values and darker colors represent 
lower values. The solid line superimposed on the plot 
is the optimal value of 9 function of a and is 
given by 



(20) 



Y J n i=l X i +m(l-a)9^ 
n + m(l — a) 2 



For example, with a = the curve gives 9^^ . The 
dashed line gives the optimal value of 9 as a func- 
tion of a under CDA-EM. This curve corresponds 
to the modes of the dashed curves in the second 
panel. The solid and dashed curves in the third panel 
differ because CDA-EM and PXEM differ in how 
they treat a' in Q(9, a\9^>, a'). PXEM fixes a' at 
the identity value under the transformation from Y 
to Z (i.e., PXEM fixes a' = 0), whereas CDA-EM 
does not update a in the iteration and sets a' = a 
throughout. The function Q(9, a\9^\a' = 0) plotted 
in panel 3 increases along the solid curve as a goes 
to — oo and the solid curve asymptotes to 9 = 0, the 
maximum likelihood estimate. Thus, both CDA-EM 
run with a = 1 and PXEM converge to the maxi- 
mum likelihood estimate in one iteration. 

One might be tempted to think that PXEM is su- 
perior to CDA-EM because it optimizes Q(9,a\9^\ 
a' = 0) over both 9 and a at each iteration, whereas 
CDA-EM optimizes Q(9,a\9^ ,a' = a) over only 9 
under a constraint that fixes a at a prespecified 
value. That is, one might expect PXEM to increase 
i more because it increases Q more. This reason- 
ing, however, not only blurs the difference in how 
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Fig. 7. Comparing the CDA-EM and the PXEM algorithms. The first panel compares 1(9) and Q(6\6') for an EM algorithm 
applied to a simple Gaussian problem. The functions are normalized to be tangent at 9^ . The second panel compares £(9) with 
Q(0, a|# ,a) for several values of a and shows how the missing data becomes less informative and Q(6, a|0 , a) becomes a 
better approximation to £(9) as a get closer to the optimal value, a = l. The final plot is a heat map of Q(9,a\9^\a' — 0) 
that is optimized in the PXEM algorithm. Lighter colors correspond to higher functional values. The function has two critical 
points, one at (9 = 0,a = 1) and one at (9 — 0,a = — oo). The solid and dashed curves give the optimal value of 9 as a 
function of a by maximizing Q(9,a\9 ( - t \a) and Q(9,a\9 i - t \a' = 0), respectively. The CDA-EM update is a saddle point of 
Q(9,a\9 {t) ,a' =0) and the PXEM update occurs in the limit as a — > —oo. Nonetheless, both algorithms return the maximum 
likelihood estimate in one iteration, 6» (t+1) = 0. 



the two algorithms treat a', but also oversimplifies 
the rates of convergence of EM-type algorithms. An 
algorithm that increases Q more at every iteration 
does not necessary converge faster. This can be seen 
clearly in the first panel of Figure 7. The optimal 
update is 9*, but 6* is far from the maximizer of 
Q. Our goal is not to increase Q more, but to make 
Q a better approximation of the log likelihood. As 



another example, the EM algorithm by definition in- 
creases Q by at least as much in its M-step as ECM 
can in a sequence of CM-steps. Nonetheless, Meng 
(1994) shows that ECM can converge faster than 
EM. In the present example, CDA-EM sets a = 1 
and updates 9 to = which is a saddle point of 

Q(9,a\0®,a/ = 0). Even though Q(9, a\9®, a' = 0) 
evaluated at the CDA-EM update is less than when 
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it is evaluated at the PXEM update, both updates 
have 0( t+1 > = and thus give the same value of the 
observed data log likelihood. The rate of conver- 
gence is more directly determined by (13) than by 
the relative increase in Q. It is this rate that CDA- 
EM aims to optimize and that PXEM improves by 
eliminating the conditioning on a; see Section 5.3. 

5.5 Nesting 

Nested EM and DA-type algorithms involve iter- 
atively using a data augmentation method to ac- 
complish one of the steps of a larger algorithm also 
involving data augmentation. Figures 8-10 illustrate 
three different ways this might be done. To motivate 
the nesting strategy, we begin with the partially- 
blocked Gibbs sampler illustrated in Figure 8 (van 
Dyk (2000b)). Although we consider a sampler com- 
posed using three full conditional distributions, the 
ideas apply immediately to samplers with arbitrarily 
many conditional distributions. In particular, sup- 
pose we wish to sample from p(0\Y ohs ), where = 
{01,02,03) by using a Gibbs sampler which samples 
from each of p(9i 1 2 , 03 , Y ohs ) , p{9 2 \ 0i , 6 3 , Y obs ) , and 
p{03\9\, 02, Y obs ) in turn. If sampling from p(0i\0 2 , 03, 
y obs ) is expensive relative to sampling from the other 
two conditional distributions, it may be beneficial 
to sample once from p(0i\0 2 ,03,Y ohs ) and then to 



p(0i\e 2 ,e 3 ) 











P(0 2 \0x,03) 




P(0 3 \01,02) 




p{0 


2, 03 


0i ) 



Random Draw 



Fig. 8. The partially blocked Gibbs sampler. The inner loop 
is iterated N times. 




p(Y{,Yf\9) 



p(e\Y?,Y£) 

J Expectation Step 
(^J) Maximization Step 
Random Draw 



Fig. 9. The MCEM algorithm. The inner loop is iterated 
several times. [In the maximization step we compute 6 to max- 
imize the conditional expectation of logp(S|y i aug , Y^^), with 
the expectation computed in the Monte Carlo expectation step. 
Here we use the superscript "a" as an abbreviation for "aug" 
or "augmented, "j 

sample from p{0 2 \0 1 ,0 z ,Y ohs ) and p{0 3 \0 1 ,0 2 ,Y ohs ) 
N times each in turn. If N is large, the internal 
Gibbs sampler delivers an approximate draw from 
the joint distribution p{02,03\0\)- If this approxima- 
tion is good, we are essentially running a blocked 
Gibbs sampler with conditional distributions p{0\\02, 
3 , Y obs ) and p(0 2 , 3 |6>i, Y obs ). The partially blocked 
Gibbs sampler is useful when the advantage of block- 
ing outweighs the cost of sampling from p{0 2 ,03\0\, 
y obs ) via a nested Gibbs sampler. This strategy may 
be helpful when 02 and 0% exhibit significant correla- 
tion given 9\ and/or p{0\\02, 03,Y ohs ) is particularly 
difficult to sample (e.g., van Dyk et al., 2001). No- 
tice there is a subtle tradeoff here. If 02 and #3 are 
(nearly) conditionally independent given 0\, then 
there is no need to run the inner iteration. If, on 
the other hand, they are highly correlated, then the 
inner iteration may need to be run many times in 
order to deliver a good draw. The key to success 
with this strategy is repeating the expensive draw 
of p(0\\02,03) as seldom as possible. 

In the context of the EM algorithm, we can im- 
plement a similar strategy when the augmented data 
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Random Draw 



Fig. 10. The nested EM algorithm with Monte-Carlo E-step 
implemented with a two-step Gibbs sampler. The inner loops 
are both iterated several times. [In the maximization step we 
compute 6 to maximize the conditional expectation oflogp(9\ 
Y^ ug , Y 2 aug ), with the expectation computed in the expectation 
steps, see Section 5.5. Here we use the superscript "a" as an 
abbreviation for "aug" or "augmented."] 

naturally divide into two or more parts. This strat- 
egy takes advantage of the fact that an EM algo- 
rithm that treats only part of Y aug as missing and 
collapses over the rest is faster in terms of pem 
(Meng and van Dyk, 1997). Thus, we aim to con- 
struct an EM algorithm using only part of Y aug . 
Although this algorithm typically does not have a 
closed form M-step, the maximization can be accom- 
plished by a second, typically closed-form, EM algo- 
rithm that treats the remainder of Y aug as missing 
data. The resulting nested EM algorithm (van Dyk, 
2000b) has an improved rate of convergence but, 
because of the nesting, each iteration requires more 
time to compute. If the computational complexity of 
the E-step is relegated to the outer loop, this trade- 
off can go in favor of the nesting strategy when con- 
sidering the actual computing time required. This 
advantage can be pronounced when the outer E-step 
requires a Gibbs sampler to compute the necessary 
conditional expectations. This is possible with the 
Monte Carol EM (MCEM) algorithm (Wei and Tan- 
ner (1990)), as is illustrated by van Dyk (2000b). 



The MCEM algorithm is compared with the nested 
EM algorithm in Figures 9 and 10. 

6. PARTIAL COLLAPSING AS 
A UNIFIED APPROACH 

While the partially-blocked nature of the sampler 
in Figure 8 is clear, the nested EM algorithm in 
Figure 10 partially removes Y aug \ C Y 2 aug from 
the data augmentation scheme in the spirit of condi- 
tional augmentation. In this regard, the nested EM 
algorithm is a type of "partially collapsed" EM algo- 
rithm. In this section we discuss a different strategy 
for partially collapsing quantities out of an EM or 
DA algorithm. In particular, in algorithms that in- 
volve model reduction, we can collapse quantities 
in some but not all of the CM-steps or conditional 
draws. It is in this sense that we use the term "par- 
tially collapsed." 

Collapsing involves constructing an algorithm on 
a marginal distribution of the target space of the 
original algorithm. That is, we construct an algo- 
rithm that works on a collapsed parameter space of 
the original parameter space. (Here the parameter 
space includes all unknowns including latent vari- 
ables and missing data.) Although this strategy is 
computational efficient it can be practically diffi- 
cult if some or all of the full conditional distribu- 
tions on the collapsed parameter space are complex 
or nonstandard distributions. Given that the aug- 
mented data are introduced specifically to simplify 
the full conditional distributions, it is not surpris- 
ing that reducing that augmented data can sacrifice 
this simplicity. Partially collapsed methods aim to 
reap some of the gains of collapsing in this situ- 
ation. In particular, when some of the conditional 
distributions on the collapsed parameter space are 
simple or at least no more complicated that the cor- 
responding conditional distribution of the original 
parameter space, partially collapsed methods mix 
conditional distributions from the two (or perhaps 
more) parameter spaces in the construction of EM- 
type algorithms and DA-type samplers. For exam- 
ple, if a conditional maximization or draw given the 
augmented data are not easier than the correspond- 
ing maximization or draw given the observed data, 
then we may as well use the version that does not 
involve data augmentation, that is the collapsed ver- 
sion. As we shall discuss, this strategy has lead to a 
number of useful algorithms. 
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6.1 The ECME and AECM Algorithms 

In order to improve the rate of convergence of the 
ECM algorithm, Liu and Rubin (1995) formulated 
the Expectation Conditional Maximization Either 
or ECME algorithm in which they suggest replacing 
one or more of the CM-steps of the ECM algorithm 
with 

Direct CM-step p: Set #(*+p/ p ) = argmax e \ogp(8\ 

y^obs^ 

subject to 6 { ^ p/P) = 6 { ^ {p ~ 1)/P) . When an itera- 
tive method is required to accomplish one or more 
of the CM-step of ECM, it is often no more diffi- 
cult to maximize the conditional log posterior di- 
rectly without recourse to data augmentation. In 
this case Liu and Rubin (1995) argue that the direct 
CM-step is expected to improve convergence with- 
out complicating implementation. We recognize this 
as a partially collapsed algorithm. If all of the ECM 
CM-steps were replaced by direct CM-steps the aug- 
mented data would be completely removed from the 
iteration. This would collapse ECM into a Gauss- 
Seidel optimizer, which is generally expected to be 
faster than ECM. Of course, if some of the CM-steps 
of ECM are simple closed-form optimizations while 
those of ECME require numerical optimization, the 
computational tradeoff can easily favor ECM over 
Gauss-Seidel. 

Meng and van Dyk (1997) set up a more gen- 
eral framework by allowing different levels of aug- 
mented data in each CM-step. The resulting algo- 
rithm is called the Alternating Expectation Condi- 
tional Maximization or AECM algorithm and gener- 
alizes both the ECME and the SAGE (Fessler and 
Hero (1994)) algorithms. In particular, Meng and 
van Dyk suggest replacing the CM-step of ECM 
with 

CM-step p: Set #(*+p/ p ) = argmax e E[logp(#| 
g p {Y^))\e^P- l )l p ^} 

subject to 9^ p/P) = 9^ p - 1)/P) . Here we have ex- 

panded Q(8\6^) according to its original definition 
with two important changes. First, Y au s is replaced 
by some function g p of y au s. This allows us to reduce 
the data augmentation by differing amounts in each 
of the P CM-steps. Here we assume <? p (y aug ) is a 
legitimate data augmentation scheme for each p. In 
particular, Y ohs is part of each g p (Y a ' ug ). Second, be- 
cause the data augmentation varies among the CM- 
steps, we must compute and E-step each time the 




CYCLE ONE 

Conditional 
v_y Maximization Step 

Fig. 11. A two-cycle AECM algorithm. [In the conditional 
maximization steps, we compute the component of 8 — (61,82) 
to maximize the conditional expectation of the log of the Quan- 
tity in the Q), with the expectation computed in the most recent 
expectation step, see Section 6.1.] 

data augmentation changes, see Figure 11. Thus, in 
the expectation of each AECM CM-step we condi- 
tion on the value of produced by the most recent 
CM-step, not the value produced at the end of the 
previous iteration. If the data augmentation is the 
same for several consecutive CM-steps (i.e., if g p is 
the same) we need only recompute the E-step at the 
beginning of this sequence. The same requirement 
holds for ECME in that the steps must be appropri- 
ately ordered relative to the E-step. The CM-steps 
that involve data augmentation must all follow the 
E-step and be performed before any of the CM-steps 
that do not involve data augmentation, unless the E- 
step is repeated. These step-ordering requirements 
are necessary to ensure monotone convergence of the 
ECME and AECM algorithms (Meng and van Dyk, 
1997). As we discuss next, similar step-ordering re- 
quirements apply to the partially collapsed Gibbs 
sampler. 

6.2 The Partially Collapsed Gibbs Sampler 

Consider the two-step data augmentation sampler 
described in Section 3.1. To clarify ideas, we rewrite 
this sampler with Y aus replaced by ip and with the 
conditioning on Y obs suppressed: 
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Step 1: V (m) ~p(^|0 (t) ), 
Step 2: ~;p(#|V> (m) ). 

Under the standard regularity conditions, we expect 
that after sufficient burn-in this sampler will effec- 
tively return correlated draws from its stationary 
distribution, p(ip,9). In order to speed up conver- 
gence to stationarity and reduce the correlation of 
the draws, we might take a cue from ECME and 
AECM and attempt to partially collapse the sam- 
pler. In particular, suppose we want to reduce the 
conditioning in Step 2. A reasonable and optimal 
strategy might seem to be the following: 

Step 1: V (m) 

Step 2: 6>(* +1 ) ~ p (0). 

Clearly, and are independent and the 

stationary distribution of this sampler is p(ip)p{9) 
which is generally different than the target distribu- 
tion, p(ijj,0). In this simple example, we need only 
change the order of the two steps to regain a chain 
with the target distribution as its stationary dis- 
tribution. Nonetheless, three important cautionary 
facts regarding partially collapsed Gibbs samplers 
are illustrated by this simple example. 

First, the "full conditional distributions" of the 
partially collapsed sampler may not be compatible 
with any joint distribution. In the simple example, 
this is illustrated by the fact that one cannot find a 
joint distribution of (ip,0) such that ijj depends on 
6 but 9 is independent of ip. This incompatibility 
means that we have left the standard Gibbs sam- 
pler framework and that standard results as well as 
our intuition may fail. Second, as with ECME and 
AECM, the order of the steps may matter. Even in 
this simple case, the stationary distribution of the 
chain depends on the order of the steps. 

Finally, the steps can sometimes be blocked to 
form a standard sampler. If we first draw 9 from 
its marginal distribution and then ip from its condi- 
tional distribution given 9, we are directly sampling 
from the joint distribution, and have thus blocked 
the two steps. In fact, blocking is a special case of 
partially collapsing. It is easy, however, to construct 
cases where partially collapsed samplers do not cor- 
respond to any blocked version of the ordinal sam- 
pler (van Dyk and Park, 2008; Park and van Dyk, 
2009). 

Given these cautionary facts, it is clear that care 
must be taken when partially collapsing a Gibbs 
sampler. Van Dyk and Park (2008) give a prescrip- 
tive method for construction such samplers that are 



guaranteed to maintain the target stationary distri- 
bution. They also argue that like blocking, partial 
collapsing improves the convergence characteristics 
of the chain, but not as much as complete collapsing. 
This, along with the fact that blocking is a special 
case of complete collapsing, unifies the blocking and 
collapsing strategies. Generally, blocking is not as ef- 
ficient as collapsing because blocking is only partial 
collapsing. 

7. REFINED ALGORITHMS FOR 
THE SPECTRAL MODEL 

By far the most computationally intensive aspects 
of the EM and DA algorithms for the spectral model 
described in Sections 3.2 and 3.4 are the removal of 
the background counts and the deblurring of the 
source counts, that is, computing the conditional 
expectation of or sampling and Yj~ for i € X 
and j G J. These tasks involve looking up values 
in the typically large matrix, M, a time-consuming 
task even when sophisticated sparse-matrix tech- 
niques are implemented. Given the computation cost 
of these steps and the hierarchical structure of the 
data augmentation, nesting is an obvious strategy. 
As an illustration, we implement a nested EM al- 
gorithm. In this algorithm we start by setting Y^ ns 
equal to Y^ for j € J . Because this augmentation is 
smaller than the complete data-augmentation scheme 
outlined in Table 1, fewer iterations of the EM algo- 
rithm are required. Because there is less augmented 
data, however, the M-step is not in closed form. 
Thus, we implement an inner EM algorithm to ac- 
complish the M-step of the outer EM algorithm. 
This strategy is similar to the algorithm illustrated 
in Figure 10, except the outer E-step does not re- 
quire a Gibbs sampler but is nonetheless computa- 
tionally demanding. The inner EM iteration fixes 
Y^ us and updates only the first three rows of Ta- 
ble 1 in the inner E-step and 9 in the M-step. If 
this inner EM converges slowly (e.g., there are many 
and/or weak emission lines), a relatively large num- 
ber of inner iterations (e.g., 10) may substantially 
improve the speed of the algorithm. The outer E- 
step updates all of Y au s . 

The advantage of nesting is illustrated using a 
spectrum of the high redshift quasar S5 0014 + 81 
collected with the Chandra X-ray Observatory as 
described by Elvis et al. (1994). The spectrum is 
modeled using a power law continuum, f(9 c ,Ej) = 
-yEJ 13 , exponential absorption, g(9 A , Ej) = e^/ E i , and 
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Fig. 12. Various EM-type algorithms for fitting the spectral 
model. The figure illustrates the computational advantage of 
nesting and conditional augmentation. All five plots show the 
convergence of the parameter v, the expected line count, as a 
function of C.P.U. time in seconds. The five plots correspond 
to the standard EM algorithm based on the data-augmentation 
scheme outlined in Table 1 (solid line); the nested EM algo- 
rithm (dotted line); the CDA-EM algorithm (dashed line); an 
algorithm that combines nesting and CDA-EM (dotted-dashed 
line); and a close up of the first 300 seconds comparing all but 
the standard EM algorithm. The solid horizontal line in each 
plot is the MLE of v. The nested and CDA-EM used here are 
described in Section 7. 



a single Gaussian emission line with location, width, 
and intensity parameters 2 for a total of six free pa- 
rameters. The first two panels of Figure 12 show 
the convergence of the expected counts attributed 
to the line, for the EM and nested EM algorithms, 
respectively. The nested EM algorithm (run with 
4 inner iterations) converges in about a third of 
the time required by the standard EM algorithm. 
The remaining panels in Figure 12 will be described 
shortly. 

To further improve the convergence of the algo- 
rithms, we can reduce the augmented information 
for 6 using the method of conditional augmentation. 
In particular, we reduce the counts attributed to the 
absorbed photons in the emission line, Yf'-Yf'. Re- 
call that absorption does not occur uniformly across 
the range of energies of an emission line, and the 
energies of the observed photons are biased towards 
areas of low absorption, complicating parameter es- 
timation. Our typical strategy, as described in Ta- 
ble 1, is to treat the absorbed photons as missing 
data. Thus, in the augmented data, there is no ab- 
sorption. It is important to note, however, that we 
need not account for (i.e., augment) all of the ab- 
sorbed photons, rather we only need the absorption 
rate to be constant across the support energies of 
the emission line. Thus, a better strategy is to aug- 
mented fewer absorbed photons, just enough so that 
the absorption rates are equal across the range of 
energies of an emission line. In particular, suppose 
Omin is the lowest absorption rate, 1 — djg(9 A , Ej), 
where j varies over the support of the emission line. 
To reduce the volume of the augmented data, we 
can compute Y-f acting as if the absorption rate 



were 1 — djg(9 ,Ej 



Here a m - m is the op- 



timal value of a working parameter, and we con- 
dition on it throughout. In this way, we add fewer 
counts to each bin. As an extreme example, con- 
sider a delta function emission line that is contained 
entirely within a single energy bin. In this case, 
the support of the emission line is one bin, a m \ n = 
1 — djg(9 A ,Ej) with j the index of the bin contain- 
ing the line, 1 — djg(9 A ,Ej) — a m ; n is zero, and we 
need not impute any missing counts to account for 
absorption in the line. We emphasize that this does 
not change the model being fit, it only improves the 



2 A Gaussian emission line is parameterized as ^(f>( ). 
where 4> is the standard normal probability density function, 
/j, is the line location, a is the line width, and v is the line 
intensity. 
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efficiency of the computation. This strategy is used 
in the CDA-EM algorithm and is combined with 
nesting in the nested CDA-EM algorithm; both algo- 
rithms are illustrated in Figure 12. The nested EM 
algorithm and the CDA-EM algorithm (coinciden- 
tally) require similar computation time, combining 
the two strategies, however, is twice as fast as either 
alone. The final panel in Figure 12 is a more detailed 
comparison of the three improved algorithms. These 
algorithms are discussed and further illustrated in 
van Dyk and Kang (2004). 

Other strategies described in this article lead to 
additional improvements. The posterior distribution 
or likelihood of the location of a narrow emission 
line, for example, is typically highly multimodal. 
The Poisson nature of the data leads to small en- 
ergy ranges with more counts than expected. These 
correspond to possible locations of a narrow emis- 
sion line and may be relatively large modes of the 
likelihood if the actual line is weak. The standard 
EM and DA algorithms described here are not able 
to jump between these modes because line location 
is updated while conditioning on which photons are 
attributed to that line. Thus, the line location will 
be among the energies of these photons and only 
photons in this energy range will be attributed to 
the line in the next step. To get around this, van 
Dyk and Park (2004) and Park and van Dyk (2009) 
suggest EM-type and DA-type samplers that remove 
the conditioning on all or part of the augmented 
data while updating the line locations. The result is 
ECME and AECM algorithms for mode finding and 
partially collapsed Gibbs samplers for posterior ex- 
ploration, all of which are much more efficient than 
the standard EM and DA algorithms. 



explicitly or implicitly to derive numerous efficient 
EM-type and DA-type algorithms with applications 
to a wide range of models including longitudinal 
data analysis for binary response and robust meth- 
ods, robust regression, binary and grey-level Ising 
models, dynamic linear models, finite mixture mod- 
els, Poisson image analysis, probit regression, multi- 
nomial probit models, switching-state space models, 
factor analysis, spectral analysis, etc. A small subset 
of examples can be found in Liu and Rubin (1994, 
1995), Gelfand, Sahu and Carlin (1995), Meng and 
van Dyk (1997, 1998, 1999), van Dyk and Tang (2003), 
van Dyk and Park (2004), Higdon (1998), Pilla and 
Lindsay (2001), Liu, Rubin and Wu (1998), van Dyk 
(2000a, 2000b), Liu and Wu (1999), van Dyk and 
Meng (2001), Foulley and van Dyk (2000), van Dyk 
and Kang (2004), Imai and van Dyk (2005a, 2005b), 
Gelman et al. (2008), Pope and Wong (2005) and 
Ghosh and Dunson (2009). We hope that this over- 
view paper will help to both further stimulate method- 
ological research and promote efficient implementa- 
tion of EM-type and DA-type algorithms in prac- 
tice. In other words, to paraphrase the title, we hope 
practitioners will have an easier time to climb likeli- 
hood surfaces using EM-type algorithms and to ex- 
plore posterior landscape using DA-type samplers. 
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8. CONCLUDING REMARKS 

The highly flexible nature of multilevel model- 
ing inhibits an off-the-shelf algorithmic approach to 
model fitting. However, the flexibility of a dynamic 
combination of data augmentation and model reduc- 
tion give us tools to tackle these models. As illus- 
trated in the spectral model, the many recent exten- 
sions and refinements of data augmentation meth- 
ods can substantially improve computational speed 
while maintaining simplicity and stable convergence, 
thus greatly extending the applicability and power 
of data augmentation . 

The data-augmentation and model-reduction strate 
gies outlined in this article have been used either 
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